Comparing a TBL Tagger with an HMM Tagger: Time Efficiency, Accuracy, Unknown Words
نویسنده
چکیده
In this paper a transformation-based learning tagger is compared with a hidden Markov model tagger. For this comparison the Brill tagger and the TnT tagger are used. The Dutch Spoken Corpus (CGN), tagged with a medium-sized (72) tagstet, is used as training and testing material. The TnT tagger outperforms the Brill tagger on larger tagsets and when relatively small training-sets (around 10.000 sentences) are used. The results support the observation by Schneider and Volk that the unknownword handling of the Brill tagger is still unsatisfactory. While both taggers achieve around the same tagging speed the Brill-tagger takes a long time to train. The Brill tagger takes around 18 hours to train for a training-set of 100.000 sentences tagged with the 72-tags tagset; the TnT tagger takes slightly less than nine seconds for that and slightly less then a minute and a half to train on a set of 900.000 sentences. The best result for the Brill-tagger on the medium-sized tagset was an error-rate of 3.9% with an training-set of 100.000 sentences. The best result for the TnT-tagger on the medium-sized tagset was an error-rate of 2.7% with an training-set of 900.000 sentences.
منابع مشابه
Probabilistic Arabic Part of Speech Tagger with Unknown Words Handling
Part Of Speech (POS) tagger is an essential preprocessing step in many natural language applications. In this paper, we investigate the best configuration of trigram Hidden Markov Model (HMM) Arabic POS tagger when small tagged corpus is available. With small training data, unknown word POS guessing is the main problem. This problem becomes more serious in languages which have huge size of voca...
متن کاملMaximum Entropy Based Bengali Part of Speech Tagging
Part of Speech (POS) tagging can be described as a task of doing automatic annotation of syntactic categories for each word in a text document. This paper presents a POS tagger for Bengali using the statistical Maximum Entropy (ME) model. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the various POS cl...
متن کاملElastic neural networks for part of speech tagging
This paper presents a part of speech (POS) neuro tagger which consists of a 3-layer perceptron with elastic input. Computer experiments show that the neuro tagger has an accuracy of 94.4% for tagging ambiguous words when a small Thai corpus with 22,311 ambiguous words is used for training. A series of comparative experiments further show that the neuro tagger is de nitely far superior to the st...
متن کاملImproved Part-of-Speech Prediction in Suffix Analysis
MOTIVATION Predicting the part of speech (POS) tag of an unknown word in a sentence is a significant challenge. This is particularly difficult in biomedicine, where POS tags serve as an input to training sophisticated literature summarization techniques, such as those based on Hidden Markov Models (HMM). Different approaches have been taken to deal with the POS tagger challenge, but with one ex...
متن کاملAdaptación del Método de Etiquetado No Supervisado TBL
This paper proposes an improvement of the Brill’s “TransformationRule Based” POS-Tagger Algorithm. Our improvement decreases training times considerably without affecting the accuracy of the algorithm.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006